library(gganimate)
library(gifski)
library(tidyverse)
library(jpeg)
library(pacman)
library(grid)
library(broom)
library(gridExtra)
significant_regressors <- function(y, X, signif_level) {
# returns a vector of significant regressor indices
}
remove_insignificant <- function(y, X, signif_level) {
# removes insignificant regressors from X matrix
# based on lm(y~X) and given significance level
}
algorithm <- function(m, n, nsteps = 4) {
# Algorithm for steps 1-5 (possibly less)
# returns number of significant regressors after nsteps
}
simulation <- function(n, nsteps) {
# Applies the algorithm to randomly generated data
}
number_signif_histogram <- function(nsteps = 4) {
# Draws histogram of no. significant regressors for
# nsteps of algorithm
}
Task: Repeat this process many times and plot the distributions of the number of significant regressors in Step 5.
# how many regressors are significant at level 0.05 in step 4?
num_signif <- simulation(1, 4)
num_signif
## [1] 11
#number_signif_histogram()
Task: Explain your findings.
Our findings are in line with results of Freedman’s simulations
(1983). First, we fit his assumptions, when number of our randomly
generated variables is (roughly) comparable to number of data
points.
Our \(R^2\) is high simply because we
have put many regressors into the linear model and relatively large
number of variables have passed the 0.5 \(p\)-value threshold from Step 2.
When we exclude variables which don’t have \(p\)-value < 0.5, we lose some of the explained variance of the model purely by composition of \(R^2\) (even in our case, when our model is filled only by random noise), so \(R^2\) will have lower value. More interestingly, the structure and number of significant variables also changes. More variables now have higher statistical significance than before the \(p\)-value cutoff. This may be caused by the fact that regressors may be (even if weakly) correlated and excluding one variable can result in rise in significance of the other.
Similarly, when we repeat whole procedure for steps 3-5, \(R^2\) will get lower, number of variables will be lower, but their significance is rising. However, their significance is biased by previous steps, which were taken in order to adjust the whole model.
Lesson to take:
Task: Compare the distributions of the numbers of significant regressors. In different steps of the algorithm number of significant regressors is higher with fewer steps, distribution seems to be normal in each case.
# Distributions of the numbers of significant regressors in
# different steps
p1 <- number_signif_histogram(1)
p2 <- number_signif_histogram(2)
p3 <- number_signif_histogram(3)
p4 <- number_signif_histogram(4)
grid.arrange(p1, p2, p3, p4, ncol=2, nrow = 2)
We have \(y_i \sim Bin(n_i, p_i)\) and \(\text{ln}(\frac{p_i}{1-p_i}) = \beta_0 + \beta_1x_{i1}+\dots+\beta_px_{ip}\).
Marginal probability mass function (pmf): \[P(Y_i=y_i) = {n_i \choose y_i} p_i^{y_i}(1-p_i)^{n_i-y_i} \] \[p_i = \frac{e^{\theta_i}}{1+e^{\theta_i}}, \quad \theta_i = \text{ln}\biggl(\frac{p_i}{1-p_i}\biggr) = \beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}\]
Joint probability mass function: \[f(y_i \space | \space n_i, p_i) = \prod_{i=1}^n{n_i \choose y_i}p_i^{y_i}(1-p_i)^{n_i-y_i}\] substitute \(p_i\) for \(\frac{e^{\theta_i}}{1+e^{\theta_i}}\):
\[ = \prod_{i=1}^n{n_i \choose y_i} \biggl[\frac{e^{\theta_i}}{1+e^{\theta_i}}\biggr]^{y_i}\biggl[1-\frac{e^{\theta_i}}{1+e^{\theta_i}}\biggr]^{n_i-y_i}\] Likelihood function:
\[L(\hat{\beta} \space | \space y_i, x_i, n_i) = \prod_{i=1}^n{n_i \choose y_i} \biggl[\frac{e^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}{1+e^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}\biggr]^{y_i}\biggl[1-\frac{e^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}{1+e^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}\biggr]^{n_i-y_i}\]
Log likelihood:
\[l(\hat{\beta}\space | \space y_i, x_i, n_i) = \text{ln}(L(\hat{\beta}\space | \space y_i, x_i, n_i))\] \[ = \sum_{i=1}^n \text{ln}\Biggl[{n_i \choose y_i} \biggl[\frac{\text{e}^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}{1+\text{e}^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}\biggr]^{y_i}\biggl[1-\frac{\text{e}^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}{1+\text{e}^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}\biggr]^{n_i-y_i}\Biggr]\] \[ = \sum_{i=1}^n\Biggl[\text{ln}{n_i \choose y_i} + y_i\Bigl(\text{ln}(e^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}) - \text{ln}(1 + \text{e}^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}})\Bigr) + (n_i-y_i)\Bigl(\text{ln}(1) - \text{ln}(1 + \text{e}^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}})\Bigr)\Biggr]\] \[ = \sum_{i=1}^n\Biggl[\text{ln}{n_i \choose y_i} + y_i(\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}) - n_i\cdot\text{ln}(1 + \text{e}^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}})\Biggr]\] \[ = \sum_{i=1}^n\text{ln}{n_i \choose y_i} + y_i \theta_i - n_i \text{ln}(1+e^{\theta_i}) \]
Score function: \[S(\hat{\beta}) = \frac{\partial{}}{\partial{\hat{\beta}}}l(\hat{\beta} \space | \space y_i, x_i, n_i) = \sum_{i=1}^n\biggl(y_i + \frac{n_i\theta_i}{1+\text{e}^{\theta_i}}\biggr)\] \[ = \sum_{i=1}^n\biggl(y_i + \frac{n_i(\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip})}{1+\text{e}^{\beta_0 + \beta_1x_{i1} + \ldots + \beta_px_{ip}}}\biggr)\]
animate(p, fps = 25, duration = 20, end_pause = 100, renderer = gifski_renderer())
confint_bootstrap <- function(nb, x, y, model) {
# Constructs a 95% confidence interval based on nonparametric bootstrap
# Does this nb-times as a simulation
# Returns the percentage of cases in which the CI covers the true value
}
nb = 5000
confidence <- confint_bootstrap(nb, x, y, model)
confidence
## [1] 0.908